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Abstract 

We obtain an index of the complexity of a random sequence by allowing the 
role of the measure in classical probability theory to be played by a function we 
call the generating mechanism. Typically, this generating mechanism will be a 
finite automata. We generate a set of biased sequences by applying a finite state 
automata with a specified number, m, of states to the set of all binary sequences. 
Thus we can index the complexity of our random sequence by the number of states 
of the automata. We detail optimal algorithms to predict sequences generated in 
this way. 
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1 Generating Mechanisms 

We explore a finite setting for the problem of prediction. In particular we are 
interested in an index of the complexity of a random sequence. In this paper, the 
role of the measure in classical probability theory will be played by a function we 
call the generating mechanism. Typically, this generating mechanism will be a finite 
automata. We generate a set of biased sequences by applying a finite state automata 
with a specified number, m, of states to the set of all binary sequences. Thus we 
can index the complexity of our random sequence by the number of states of the 
automata. 

We will show the prediction algorithms which minimise average error for varying 
degrees of knowledge about the generating mechanism. We will then show how the 
index of complexity used can enable us to consider the batch setting - how best to 
predict after exposure to a given set of training data. This allows an interpretation 
of Occam's razor - when and how simpler predictors are better. 

Finally we discuss the case of prediction with restricted resources, again utilizing 
the number of states of the generating mechanism as our index of complexity. 
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Figure 1: An example of the type of finite automata known as a Meally Machine with 
7 states. The active state is initially 5*0 It changes according to an input sequence, for 
example 001111 would cause the following order of states to be active: SoSiSiS5SjSoS2, 
and the output sequence would be 000100. 

2 Mathematical Setting. 

We consider the set of all length t binary sequences, 5* = {0,1}*, which we call 
the generating sequences. We consider them acted upon by a particular finite state 
automata, G, which we will call the generating mechanism. We define a finite 
automata as follows: 

Definition 2.1. A finite automata is a system consisting of a set of states S, a 
transition function f : ^xjO, 1} S, and an output function g : S'xjO, 1} — > {0, 1}, 
together with an element of S designated as the 'active state', initially labelled as 
Sq. Upon receiving a binary input sequence, the active state will change as specified 
by the transition function, and at each transition will output according to the output 
function. See fig\^ 

For more on finite automata, see any introductory textbook, eg. [3]. 

Example 2.2. 1. A ring automata that creates a periodic sequence out of any 
input sequence. G{S) contains only one element. 

2. A shift automata that maps all sequences to a shifted version, eg 010111 goes 
to 0010111. This can be implemented in two states. 

This particular finite state automata generates a new set, the set of output 
sequences, G{S*). In G'(5*), particular sequences may appear several times, or not 
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Figure 2: The prediction setup. is the set of binary sequences of length t. G'(S'*) is 
the set of possible outputs of the automata G. G{g) is a particular element in G{S^). In 
a prediction setting, G{g) represents the observed data. We try to predict G{g)t given 
G{g)i . . .G{g)t-i in the best possible way; specifically, we design a prediction algorithm 
to minimize the error metric. 

at all. We consider possible algorithms, p{G{S^)), of predicting the ith element of 
G{S^) given all elements up to and including to i — 1. We answer the following 
question under certain conditions: "After observing a sequence G{g) in G(5*) up to 
time t, what is the best way of predicting the next element in the sequence?" . See 

fig El 

In this article, we define the optimal prediction algorithm, p, in several cases, 
using the average error as the metric of performance (we define the average error 
below). We calculate the average error associated with these cases as a function 
of the structure of the generating mechanism(s) involved. Specifically we deal with 
the following cases in order: 

• We know the structure and active state of G at all times t. 

• We know the structure of G, but no information as to which state is active 

• We know that G is one finite automata from a known set of finite automata. 

We then proceed to the case where we have restricted resources. That is, we are 
predicting a mechanism which could have up to m states, using the automata with 
k states or less. How best should we predict the output of such a mechanism, when 
we don't know the generating sequence? 

First we consider metrics for the performance of any prospective predictor. 
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3 Measuring the Error 



We consider two measures of the error of an arbitrary prediction algorithm apphed 
to elements of G{g). First define the error: 

E\g) = \Y.G{g\®p{G{g)),, (1) 
1=1 

where © denotes binary summation modulo 2. Then the average error is 

1 1 * 

Ea,e := ^ Y.-tY.G^9)^®p{G{g)% (2) 

gg5* 1=1 

and the worst case error is 

:=max^*(5). (3) 

g&G 

We could consider other metrics to optimise eg, prediction paths with error above 
a specified fraction t count are unacceptable, and error below t count as acceptable, 
find a predictor which maximises total count of acceptable sequences. Here we only 
consider the average error, Eave- 



3.1 Perfect Knowledge - known active state and struc- 
ture 

Suppose we know the structure and active state of G. We are still only able to 
determine the output digit from the generating mechanism for certain situations. 
Every active state has a transition from it corresponding to an input of 0, and a 
transition corresponding to an input of 1. Each transition produces an output of 
either or 1, and thus we have four possible situations. We label them by their 
output digits: Lqq^Lqi^Liq^Lh. SeefigO 

For situations -Lqo, Lh, whatever the next digit of the generating sequence, we 
can be sure about the next digit. We call these type of states, with output transitions 
of case Lqo and case Ln, biased states. For Lqi and Lio, we will be wrong for 1 
possible generating sequence digit, and correct for another. 

Thus even if we know the active state, and structure of mechanism, the best we 
can predict depends on the frequency of occurrence of biased states. If the number 
of times a state s is active over the first t timesteps of g is at{s) say, then 

Ei{g)= J2 «*(^)- (4) 

{s: s is unbiased} 

The frequency of occurrence of a particular state s £ G, over the first t digits of 
a sequence g € S is defined as 

f\s,g):=latis). (5) 
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Figure 3: States can have one of four different input output combinations. In the figure 
the transitions are labelled (input, output). If we know the active state of a generating 

mechanism is of type Lqo or Lu then we can be sure of the next output. In the two other 
situations (the unbiased states) the output digit will depend on the input. 

We get the average frequency of occurrence by averaging this over the set S and 
taking the Umit: 

m:=lim^,^f{s,9). (6) 
Thus if we always know the active state, the average long term error, E{G) will be: 

E{G)= J2 /(^)' (7) 

# unbiasedstates 

which is an upper bound on the average error of any prediction algorithm. 

3.1.1 Calculation of state frequencies, f{s), for certain machine 
structures 

We represent some of the information contained in the transition function of G by 
an adjacency matrix A - with Aij being the number of possible transitions from 
state i to state j. Thus A contains entries of either 0,1 or 2. We can determine /(s) 
from knowledge of the adjacency matrix A of the mechanism G. 

The number of paths leading from state i to state j in t steps is given by ij th 
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entry of the fth power of the adjacency matrix thus: 



2* t 



1 * 

r^tE<s (10) 



Now the adjacency matrix of a mechanism G can be normahsed by a factor 
of 1/2, and this normahsed adjacency matrix has rows which sum to 1. Call the 
normalised matrix A^. We thus examine the limit: 



We now borrow a standard result from the theory of Markov Chains (see any 
introductory text on the subject, eg. [BJ) If is a irreducible and aperiodic, the 
limit operation 

lim {N')s,s = vr, (12) 

t— *oo 

defines a stationary vector, and that this vector is the largest eigenvector of N (with 
entries summing to 1). One can show that this result implies 

}^JttKs = -s. (13) 

i=l 

Thus if is irreducible and aperiodic, f{s) can be calculated by determining the 
largest eigenvector of the normalised adjacency matrix of the generating mechanism. 
It remains to prove that time averaging allows us to drop the aperiodic condition. 



4 A known Mechanism, but with unknown ac- 
tive state 

Suppose we wish to predict an output sequence, G{g) at time t, given only the 
structure of G, its initial state and the observed data sequence G{g) up to time 
t-1. 
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4.0.2 Optimal prediction 



We now detail the optimal prediction algorithm in this case. First we make the 
following definition: 

Definition 4.1. Given a generating machine G, we say a generating sequence g is 
consistent up to time t with output sequence, G{g'), if the first t digits of G{g) agree 
with G{g'). We also say that sequences g and g' are consistent with each other if 
they are both consistent with the same output sequence. Because the operation of 
consistency forms an equivalence relation, we can partition the set of generating 
sequences into sets C'^(g) defined by the output sequence G{g). We call these sets 
the consistency classes — each sequence in a consistency class is consistent with all 
other sequences in that class. 

Now, we can write the average error: 



We note that because the observed data, G{g) is the same for all 5 in a consistency 
class, the prediction p will be identical for all elements in the class. If we desire to 
choose our predictor, p, in order minimize the average error, then for each class, 
p{G{g)i . . . G{g)t) should be if G{g)t+i = more often than G{g)t+i = 0. Vice 
versa, if G{g)t-{-i = 1 more often than G{g)t+i = 0, then p{G{g)\ . . . G{g)t) should 
be 1. 

More precisely, let the number of g for which G{g)t+i = be i^pt+i- Let the 
number of g for which G{g)t+i = 1 be We can determine these quantities 

from the knowledge of the location of the active states for each generating sequence 
within a consistency class. Then: 




(14) 



as a sum over the consistency classes 




(15) 



cec gec 



#Pt+i = l-^ool + 1-^01 1 + 1-^10 
#9t+l = 1-^11 1 + 1-^01 1 + 1-^10 



(16) 
(17) 



Now the combined error 



J2 Gig)t+i®p{G{g))t+i+ J2 Gig)t+i ® p{G{g))t+i 




= Yl 0®p{G{g))t+i+ Yl ^®P{G{g))t+i 



G(3)t+i=0 G(g)t+i=l 



r : G(3)m = 1}| ifp(G(5))m=0 

\ \{C'^^l^ : Gi9)t+i = 0}| if p(G(^))t+i = 1, 
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Thus to minimize the error, we define: 



p{G{g)i... 




if #pt+i > #qt+i 

1 if #Pt+i < #qt+i 



(18) 



Then 



J2 G{g)t+iep{G{g))t+i+ J2 G{g)t+i e p{G{g))t+i 



G(g)t+i=0 G{g)t+i=l 

= mm{#pt+i,#qt+i} 



4.0.3 Average Error 

Given the structure of G, can we determine the average error in a similar fashion 
to the case where we always knew the active state? 



c/e5' 1=1 

To calculate the best prediction we note we only require the knowledge of the 
following: 

Definition 4.2. The consistency vector of an output sequence G{g) is a size k 
vector, where the i 'th entry contains the number of generating sequences g active at 
state i which are consistent with G{g). 

Prom the structure of G we can define two matrices B,C which evolve the 
consistency vector under the inputs of and 1 respectively. We note that B+C = A. 

We note that one can reach the same ratio of active states (and thus make the 
same prediction) by more than one generating sequence. Our predictions and errors 
are only determined by the ratio of generating sequences active at states si to Sk- 
Thus we can be somewhat more accurate with our choice of equivalence classes. We 
can define a space of all possible ratios, ratio space. If we know the time average of 
the number of sequences active at each point in ratio space, then we can calculate 
the average error. 

We have not yet determined this time average, and thus determining a closed 
form for the limit of Eave as t tends to infinity, in terms of the adjacency matrix of 
the generating mechanism, is a problem that remains to be solved. 




(19) 
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5 Selecting from a set of automata 



Wc now consider the setting where we do not know the particular generating mech- 
anism, we only know that it is a member of a prescribed finite set of generating 
mechanisms {Gi, . . . G„}. In this case, the best prediction algorithm we can use is 
the same as the previous case of a known mechanism but unknown active state. We 
replace the tracking of all possible consistent generating sequences with the track- 
ing of all consistent {Gi,g) pairs. At a given timestep, we make our prediction by 
comparing the number of (Gj , g) pairs which predict with the number that predict 
1. We predict a if the number predicting is larger than the number predicting 
1, and we predict otherwise. The error associated with such a prediction will be 
the minimum of these numbers. 

We conjecture that the asymptotic error Eave of this situation will be the same 
as the previous case. Secondly, this may end up having significant computational 
cost. If there are symmetries in the set of {Gi, . . . then we may be able to 

increase the speed of this exhaustive search algorithm significantly (and possibly 
perform nearly as well). 

6 The batch setting - Occam's Razor on a fi- 
nite set of mechanisms 

We now consider a batch setting - that is, given the performance of different pre- 
dictors over a training data set, how should one choose the predictor with the best 
performance on future data? Even if we have a predictor which makes no error over 
the data set, this does not guarantee anything about the performance of that same 
predictor over future data. We set up the problem precisely as follows: 

We have an unknown generating mechanism G in some finite set of mechanisms, 
{Gi . . . Gn\- Gjn produces a particular output sequence for a given generating se- 
quence g, which we call the training data: Oi...Ot. We have a set of predictors 
P = {Pi}. Given the number of errors each predictor, Pi, makes on the training 
data, how does one pick the predictor with minimum error on continuations of the 
data: the sequence ot+i . . .ot? 

The first step is to collect information about the 'likelihood' for each generating 
mechanism that might have been responsible for the training data. We represent 
this as a set of generating sequence, generating mechanism pairs, {g,G), whose 
output results in the training data: G{g) = oi . . . oj. 

Any generating sequence could be responsible for continuation of the data. How- 
ever, we know that only a certain set of generating sequence, generating mechanism 
pairs could have resulted in the training data. For each predictor, we can calcu- 
late the asymptotic performance of the predictor applied to a particular generating 
mechanism. We make the following definition: 

Definition 6.1. The average error of a predictor P with respect to G at time t, is 
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the average number of errors made by P when trying to predict an output sequence 
G{g), and then averaged over all generating sequences g of length t. 



E\P, G):= — Y,Y1 • • • G{g)^-l) G{g),. 



(20) 



We can determine this quantity for each P by an exhaustive search over ah g. 
We consider the average performance of a predictor P^ over this set of pairs. 



where we include the generating sequence in the definition of the predictor, because 
at time t it defines the starting state of Gm- 

Finding the Pk which minimizes quantity (I2ip gives the best predictor to use. 
This predictor may not be the one with the best performance over the training data. 
One can observe that we do not use the number of errors that each predictor makes 
on the training data in the calculation directly. 

Again there are opportunities for implementing these algorithms more efficiently 
by using symmetries. 

We can also calculate this quantity for types of predictors other than automata, 
for example decision trees. 



In the above setting, we have assumed that whilst we have a restricted number of 
predictors, we have an infinite amount of resources to allow us to make the best 
choice. We now consider the problem where the resources with which we implement 
and select the model are restricted. That is, we have X memory states for both 
determining the best predicting algorithm and implementing it. 

If one takes the resources restriction to be represented by the number of states in 
an automata, then we have set this problem up as one of finding the 'best' predicting 
automata with a number of states. 

Now, we have to specify our method of selecting the best automata, before we 
see the training data. That is we must choose our automata before we see the 
training data. The individual predictors are encapsulated in the structure of the 
single automata chosen as our best method. After moving around according to the 
training data, these states must then perform well on the actual data. 

We can conduct an exhaustive search to find these optimal automata for finite 
values of t. 




(21) 
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7 Restricted numbers of states 
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8 Comments 



Utilising the number of states of a finite automata as an index of the complexity of 
a random sequence allows one to ask quantitative questions about prediction with 
restricted resources. Other indices are certainly possible. 

We are also interested in understanding how well one can predict a k state au- 
tomata with an m < k automata. Related work has been done - see for example, 
Meron and Feder's paper "Finite-Memory universal prediction for individual se- 
quences" 15]. We would like to see this extended to a formula describing how well 
one can predict an unknown automata of size k with automata of size m < k. 

We note that numerical application of these algorithms is computationally inten- 
sive. For example the number of binary automata grows quickly with the number 
of states, k. eg. {2k)'^^ /kl, see [I],[2]- Or see [7] for enumeration of strongly con- 
nected automata (any state is accessible from any other state). Speeding up these 
kind of exhaustive searches is of great interest. We note Helmbold and Schapire's 
work [3| in efficient approximation of a prediction algorithm using the symmetries 
of underlying predictors. We speculate that a similar result may be applicable to 
finite automata. 

This research was made possible by funding from Science Foundation Ireland 
through MACSI, and programme 06/IN. 1/1366. We would like to thank V. Vovk 
for his comments. 
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